Involves utilizing a vast set of tools for understanding data
We can break these tools into two broad categories: Supervised and Unsupervised
We will utilize these tools to build statistical models for predicting or estimating an output based on one or more inputs
Outcome measurement: Y
Vector of p predictor measurements X
In the regression problem, Y is quantitative (e.g., price, blood pressure, temperature)
In the classification problem, Y takes values in a finite, unordered set (e.g., survived/died, digit 0-9, cancer class of tissue sample)
We have training data \((x_1, y_1),...,(x_N, y_N)\). These are observations of these measurements
On the basis of the training data we would like to:
Accurately predict unseen test cases
Understand which inputs (independent variables) affect the outcome (dependent variable) and how
Access the quality of oour prediction and inferences
It is important to understand the ideas behind the various techniques, in order to know how and when to use them
One has to understand the simpler methods first, in order to grasp the more sophisticated ones
It is important to accurately assess the performance of a method, to know how well or how badly it is working [simpler methods often perform as well as fancier ones!]
This is an exciting research area, having important applications in science, industry and finance
Statistical learning is a fundamental ingredient in the training of a modern data scientist
No outcome (dependent) variable
Objective is more fuzzy
Find groups of samples hat behave similarly
Find features (predictors) that behave similarly
Find linear combinations of features (predictors, independent variables) with the most variation
Difficult to know how well you are doing
Different from supervised learning, but can be useful as a pre-processing step for supervised learning